Why use R for plotting?
Why use R for plotting?
Why use R for plotting?
Why use R for plotting?
Beautiful and flexible graphics!
Code and HTML available at http://qcbs.ca/wiki/r/workshop3
The ggplot2 package lets you make beautiful and customizable plots of your data. It implements the grammar of graphics, an easy to use system for building plots.
Required packages
install.packages("ggplot2")
library(ggplot2)
?qplot
arguments:
data
x
y
…
Look at pre-loaded "iris" dataset:
?iris head(iris) str(iris) names(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
qplot(data = iris,
x = Sepal.Length,
y = Sepal.Width)
qplot(data = iris,
x = Species,
y = Sepal.Width)
arguments:
xlab
ylab
main
qplot(data = iris,
x = Sepal.Length, xlab = "Sepal Length (mm)",
y = Sepal.Width, ylab = "Sepal Width (mm)",
main = "Sepal dimensions")
Produce a basic plot with built in data (5 minutes)
?CO2 data(CO2) ?BOD data(BOD)
qplot(data = CO2,
x = conc, xlab = "CO2 Concentration (mL/L)",
y = uptake, ylab = "CO2 Uptake (umol/m^2 sec)",
main = "CO2 uptake in grass plants")
qplot(data = CO2, x = conc, y = uptake,
xlab = expression(paste(CO[2]," Concentration (ml/L)")),
ylab = expression(paste(CO[2]," Uptake (", mu, "mol/", m^2," sec)")),
main = expression(paste(CO[2]," uptake in grass plants")))
A graphic is made of elements (layers)
A graphic is made of elements (layers)
A graphic is made of elements (layers)
geom_point(): scatterplotgeom_line(): lines connecting points by increasing value of xgeom_path(): lines connecting points in sequence of appearancegeom_boxplot(): box and whiskers plot for categorical variablesgeom_bar(): bar charts for categorical x axisgeom_histogram(): histogram for continuous x axisEdit any single element to produce a new graph e.g., by changing the coordinate system
Create a simple plot object:
plot.object <- ggplot() OR qplot()Add graphical layers:
plot.object <- plot.object + layer()Repeat step 2 until statisfied, then print:
print(plot.object)qplot() vs ggplot()qplot(data = iris,
x = Sepal.Length,
xlab = "Sepal Length (mm)",
y = Sepal.Width,
ylab = "Sepal Width (mm)",
main = "Sepal dimensions")
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
xlab("Sepal Length (mm)") +
ylab("Sepal Width (mm)") +
ggtitle("Sepal dimensions")
basic.plot <- ggplot(data=iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_point()+
xlab("Sepal Length (mm)") + ylab("Sepal Width (mm)")+
ggtitle("Sepal dimensions")
basic.plot
Add aesthetics using aes()
basic.plot <- basic.plot + aes(colour = Species, shape = Species) basic.plot
Add linear regressions with geom_smooth()
linear.smooth.plot <- basic.plot + geom_smooth(method = "lm", se = F) linear.smooth.plot
Produce a colourful plot with linear regression (or other smoother) from built in data such as the CO2 dataset or the msleep dataset
?CO2 data(CO2) ?msleep data(msleep)
Example using loess smoothing
data(CO2)
CO2.plot <- ggplot(data = CO2, aes(x = conc, y = uptake, colour = Treatment)) +
geom_point() +
xlab("CO2 Concentration (mL/L)") +
ylab("CO2 Uptake (umol/m^2 sec)") +
ggtitle("CO2 uptake in grass plants") +
geom_smooth(method = "loess")
CO2.plot
Example using loess smoothing
Data becomes difficult to visualize when there are multiple factors, e.g., the CO2 data set contains data on CO2 uptake for chilled vs non-chilled treatments from two different regions. Let's build a basic plot using this data set:
CO2.plot <- ggplot(data = CO2, aes(x = conc, y = uptake, colour = Treatment)) +
geom_point() +
xlab("CO2 Concentration (mL/L)") +
ylab("CO2 Uptake (umol/m^2 sec)") +
ggtitle("CO2 uptake in grass plants")
CO2.plot
If we want to compare regions, it is useful to make two panels. Synthax: plot.object + facet_grid(rows ~ columns)
CO2.plot <- CO2.plot + facet_grid(. ~ Type) CO2.plot
Now that we have 2 facets, let's observe how the CO2 uptake evolves as CO2 concentrations rise, by adding connecting lines to the points using geom_line():
CO2.plot + geom_line()
Wrong! Because each treatment in each region has 3 replicates
Specify groups
CO2.plot <- CO2.plot + geom_line(aes(group = Plant)) CO2.plot
Data Visualization with ggplot2 Cheat Sheet
help(package = ggplot)
http://ggplot2.tidyverse.org/reference/
Explore a new geom and other plot elements with your own data or built in data.
data(msleep) data(OrchardSprays)
data(OrchardSprays) box.plot <- ggplot(data = OrchardSprays, aes(x=treatment, y=decrease)) + geom_boxplot() box.plot
ggsave() will write directly to your working directory all in one line of code and you can specify the name of the file and the dimensions of the plot:
ggsave("CO2_plot.pdf",
CO2.plot,
height = 8.5,
width = 11,
units = "in")
Note that vector format (e.g., pdf, svg) are often preferable choice compared to raster format (jpeg, png, …)
Other methods to save image ?pdf ?jpeg
CO2.plot +
scale_colour_manual(values = c("nonchilled" = "red", "chilled" = "blue"))
install.packages("RColorBrewer")
require(RColorBrewer)
display.brewer.all()
CO2.plot + scale_color_brewer(palette = "Dark2")
Wes Anderson colour palette
install.packages("wesanderson")
library(wesanderson)
Wes Anderson colour palette
Wes Anderson colour palette
Wes Anderson colour palette
Wes Anderson colour palette
ggplot(data = iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point(size = 3) +
scale_color_manual(values = wes_palette("GrandBudapest2"))
tidyr to reshape data frames| Species | DBH | Height |
|---|---|---|
| Oak | 12 | 56 |
| Elm | 20 | 85 |
| Ash | 13 | 55 |
| Species | Measurement | Value |
|---|---|---|
| Oak | DBH | 12 |
| Elm | DBH | 20 |
| Ash | DBH | 13 |
| Oak | Height | 56 |
| Elm | Height | 85 |
| Ash | Height | 55 |
Long data format has a column for possible variables and a column for the values of those variables
Wide data format has a separate column for each variable or each factor in your study
Wide data frame can be used for some basic plotting in ggplot2, but more complex plots require long format (example to come)
dplyr, lm(), glm(), gam() all require long data format
Tidying allows you to manipulate the structure of your data while preserving all original information
gather() - convert from wide to long format
spread() - convert from long to wide format
tidyr installationinstall.packages("tidyr")
library(tidyr)
gather columns into rows(wide <- data.frame(Species = c("Oak", "Elm", "Ash"),
DBH = c(12, 20, 13), Height = c(56, 85, 55)))
## Species DBH Height ## 1 Oak 12 56 ## 2 Elm 20 85 ## 3 Ash 13 55
gather(data, key, value, ...)
- data A data frame (e.g. wide)
- key name of the new column containing variable names (e.g. Measurement)
- value name of the new column containing variable values (e.g. Value)
- ... name or numeric index of the columns we wish to gather (e.g. DBH, Height)
gather columns into rowswide
## Species DBH Height ## 1 Oak 12 56 ## 2 Elm 20 85 ## 3 Ash 13 55
long = gather(wide, Measurement, Value, DBH, Height) long
## Species Measurement Value ## 1 Oak DBH 12 ## 2 Elm DBH 20 ## 3 Ash DBH 13 ## 4 Oak Height 56 ## 5 Elm Height 85 ## 6 Ash Height 55
spread rows into columnslong
## Species Measurement Value ## 1 Oak DBH 12 ## 2 Elm DBH 20 ## 3 Ash DBH 13 ## 4 Oak Height 56 ## 5 Elm Height 85 ## 6 Ash Height 55
spread(data, key, value)
- data A data frame (e.g. long)
- key Name of the column containing variable names (e.g. Measurement)
- value Name of the column containing variable values (e.g. Value)
spread rows into columnslong
## Species Measurement Value ## 1 Oak DBH 12 ## 2 Elm DBH 20 ## 3 Ash DBH 13 ## 4 Oak Height 56 ## 5 Elm Height 85 ## 6 Ash Height 55
wide2 = spread(long, Measurement, Value) wide2
## Species DBH Height ## 1 Ash 13 55 ## 2 Elm 20 85 ## 3 Oak 12 56
Using the airquality dataset, gather all the columns (except Month and Day) into rows.
Then spread the resulting data frame to return to the original data format.
?airquality data(airquality)
gather all the columns (except Month and Day) into rows.air.long <- gather(airquality, variable, value, -Month, -Day) head(air.long)
## Month Day variable value ## 1 5 1 Ozone 41 ## 2 5 2 Ozone 36 ## 3 5 3 Ozone 12 ## 4 5 4 Ozone 18 ## 5 5 5 Ozone NA ## 6 5 6 Ozone 28
Note that the syntax used here indicates that we wish to gather ALL the columns exept Month and Day. It is equivalent to: gather(airquality, value, Ozone, Solar.R, Temp, Wind)
spread the resulting data frame to return to the original data format.air.wide <- spread(air.long, variable, value) head(air.wide)
## Month Day Ozone Solar.R Temp Wind ## 1 5 1 41 190 67 7.4 ## 2 5 2 36 118 72 8.0 ## 3 5 3 12 149 74 12.6 ## 4 5 4 18 313 62 11.5 ## 5 5 5 NA NA 56 14.3 ## 6 5 6 28 NA 66 14.9
separate columnsseparate() splits a columns by a character string separator
separate(data, col, into, sep)
data A data frame (e.g. long)col Name of the column you wish to separateinto Names of new variables to createsep Character which indicates where to separateseparate() exampleCreate a fictional dataset about fish and plankton
set.seed(8)
messy <- data.frame(id = 1:4,
trt = sample(rep(c('control', 'farm'), each = 2)),
zooplankton.T1 = runif(4),
fish.T1 = runif(4),
zooplankton.T2 = runif(4),
fish.T2 = runif(4))
messy
## id trt zooplankton.T1 fish.T1 zooplankton.T2 fish.T2 ## 1 1 control 0.3215092 0.76914695 0.4323914 0.001301721 ## 2 2 control 0.7189275 0.64449114 0.5449621 0.264458864 ## 3 3 farm 0.2908734 0.45704489 0.1382243 0.276532247 ## 4 4 farm 0.9322698 0.08930101 0.9278123 0.521107042
separate() exampleFirst convert the messy data frame from wide to long format
messy.long <- gather(messy, taxa, count, -id, -trt) head(messy.long)
## id trt taxa count ## 1 1 control zooplankton.T1 0.3215092 ## 2 2 control zooplankton.T1 0.7189275 ## 3 3 farm zooplankton.T1 0.2908734 ## 4 4 farm zooplankton.T1 0.9322698 ## 5 1 control fish.T1 0.7691470 ## 6 2 control fish.T1 0.6444911
separate() exampleThen we want to split the 2 sampling time (T1 and T2).
messy.long.sep <- separate(messy.long, taxa,
into = c("species", "time"), sep = "\\.")
head(messy.long.sep)
## id trt species time count ## 1 1 control zooplankton T1 0.3215092 ## 2 2 control zooplankton T1 0.7189275 ## 3 3 farm zooplankton T1 0.2908734 ## 4 4 farm zooplankton T1 0.9322698 ## 5 1 control fish T1 0.7691470 ## 6 2 control fish T1 0.6444911
The argument sep = "\\." tells R to splits the character string around the period (.). We cannot type directly "." because it is a regular expression that matches any single character.
tidyrA package that reshapes the layout of data sets.
Converting from wide to long format using gather()
Converting from long format to wide format using spread()
Split and merge columns with unite() and separate()
ggplot and tidyrhead(airquality)
## Ozone Solar.R Wind Temp Month Day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 ## 3 12 149 12.6 74 5 3 ## 4 18 313 11.5 62 5 4 ## 5 NA NA 14.3 56 5 5 ## 6 28 NA 14.9 66 5 6
The dataset is in wide format, where measured variables (Ozone, Solar.R, Wind and Temp) are each in their own columns.
ggplot and tidyrLet's use ggplot to visualize each individual variable and the range it displays for each month in the time series
fMonth <- factor(airquality$Month) # Convert the Month variable to a factor. ozone.box <- ggplot(airquality, aes(x = fMonth, y = Ozone)) + geom_boxplot() solar.box <- ggplot(airquality, aes(x = fMonth, y = Solar.R)) + geom_boxplot() temp.box <- ggplot(airquality, aes(x = fMonth, y = Temp)) + geom_boxplot() wind.box <- ggplot(airquality, aes(x = fMonth, y = Wind)) + geom_boxplot()
ggplot and tidyrYou can use grid.arrange() in the package gridExtra to arrange the 4 separate plots into one panel for viewing.
combo.box <- grid.arrange(ozone.box, solar.box, temp.box, wind.box,
nrow = 2)
# nrow = number of rows you would like the plots displayed on
ggplot and tidyrggplot and tidyrConvert airquality to long format to use facet_wrap() for the different variables as opposed to by month
air.long <- gather(airquality, variable, value, -Month, -Day) fMonth.long <- factor(air.long$Month) weather <- ggplot(air.long, aes(x = fMonth.long, y = value)) + geom_boxplot() + facet_wrap(~ variable, nrow = 2)
ggplot and tidyrggplot and tidyrfacet_wrap puts all the individual variables on the same scale, which can be useful in many cases. However, here, we can't see the variability in the Wind and Temp variables. We can free the y axis in each panel using scales=free
weather <- weather + facet_wrap(~ variable, nrow = 2, scales = "free")
ggplot and tidyrggplot and tidyrWe can also use the long data format to create a plot with all the variables included on a single plot
w2 <- ggplot(air.long, aes(x = Day, y = value, colour = variable)) + geom_point() + # put all day measurements on one plot facet_wrap(~ Month, nrow = 1) #split observations by month
ggplot and tidyrdplyrdplyrSome corresponding R base functions: split(), subset(), apply(), sapply(), lapply(), tapply() and aggregate()
dplyrinstall.packages("dplyr")
library(dplyr)
dplyrThese 4 core functions tackle the most common manipulations when working with data frames
select(): select columns from a data framefilter(): filter rows according to defined criteriaarrange(): re-order data based on criteria (e.g. ascending, descending)mutate(): create or transform values in a columnselect columnsselect(data, ...)
... Can be column names or positions or complex expressions separated by commasselect(data, column1, column2) select columns 1 and 2 select(data, c(2:4,6)) select columns 2 to 4 and 6 select(data, -column1) select all columns except column 1 select(data, start_with(x.)) select all columns that start with "x."
select columnsselect columnsExample: suppose we are only interested in the variation of Ozone over time within the airquality dataset
ozone <- select(airquality, Ozone, Month, Day) head(ozone)
## Ozone Month Day ## 1 41 5 1 ## 2 36 5 2 ## 3 12 5 3 ## 4 18 5 4 ## 5 NA 5 5 ## 6 28 5 6
filter rowsExtract a subset of rows that meet one or more specific conditions
filter(dataframe, logical statement 1, logical statement 2, ...)
filter rowsExample: we are interested in analyses that focus on the month of August during high temperature events
august <- filter(airquality, Month == 8, Temp >= 90) # same as: filter(airquality, Month == 8 & Temp >= 90) head(august)
## Ozone Solar.R Wind Temp Month Day ## 1 89 229 10.3 90 8 8 ## 2 110 207 8.0 90 8 9 ## 3 NA 222 8.6 92 8 10 ## 4 76 203 9.7 97 8 28 ## 5 118 225 2.3 94 8 29 ## 6 84 237 6.3 96 8 30
arrangeRe-order rows by a particular column, by default in ascending order
Use desc() for descending order.
arrange(data, variable1, desc(variable2), ...)
arrangeExample: 1. Let's use the following code to create a scrambled version of the airquality dataset
air_mess <- sample_frac(airquality, 1) head(air_mess)
## Ozone Solar.R Wind Temp Month Day ## 35 NA 186 9.2 84 6 4 ## 63 49 248 9.2 85 7 2 ## 93 39 83 6.9 81 8 1 ## 33 NA 287 9.7 74 6 2 ## 99 122 255 4.0 89 8 7 ## 145 23 14 9.2 71 9 22
arrangeExample: 2. Now let's arrange the data frame back into chronological order, sorting by Month then Day
air_chron <- arrange(air_mess, Month, Day) head(air_chron)
## Ozone Solar.R Wind Temp Month Day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 ## 3 12 149 12.6 74 5 3 ## 4 18 313 11.5 62 5 4 ## 5 NA NA 14.3 56 5 5 ## 6 28 NA 14.9 66 5 6
Try : arrange(air_mess, Day, Month) and see the difference
mutateCompute and add new columns
mutate(data, newVar1 = expression1, newVar2 = expression2, ...)
mutateExample: we want to convert the temperature variable form degrees Fahrenheit to degrees Celsius
airquality_C <- mutate(airquality, Temp_C = (Temp-32)*(5/9)) head(airquality_C)
## Ozone Solar.R Wind Temp Month Day Temp_C ## 1 41 190 7.4 67 5 1 19.44444 ## 2 36 118 8.0 72 5 2 22.22222 ## 3 12 149 12.6 74 5 3 23.33333 ## 4 18 313 11.5 62 5 4 16.66667 ## 5 NA NA 14.3 56 5 5 13.33333 ## 6 28 NA 14.9 66 5 6 18.88889
magrittrUsually data manipulation require multiple steps, the magrittr package offers a pipe operator %>% which allows us to link multiple operations
magrittrinstall.packages("magrittr")
require(magrittr)
magrittrSuppose we want to analyse only the month of June, then convert the temperature variable to degrees Celsius. We can create the required data frame by combining 2 dplyr verbs we learned
june_C <- mutate(filter(airquality, Month == 6),
Temp_C = (Temp-32)*(5/9))
As we add more operations, wrapping functions one inside the other becomes increasingly illegible. But, step by step would be redundant and write a lot of objects to the workspace.
magrittrAlternatively, we can use maggritr's pipe operator to link these successive operations
june_C <- airquality %>%
filter(Month == 6) %>%
mutate(Temp_C = (Temp-32)*(5/9))
Advantages :
less redundant code
easy to read and write because functions are executed in order
dplyr::group_by and summariseThe dplyr verbs become especially powerful when they are are combined using the pipe operator %>%. The following dplyr functions allow us to split our data frame into groups on which we can perform operations individually
group_by() : group data frame by a factor for downstream operations (usually summarise)
summarise() : summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)
dplyr - Split-Apply-CombineThe group_by function is key to the Split-Apply-Combine strategy
dplyr - Split-Apply-Combinedplyr - Split-Apply-CombineExample: we are interested in the mean temperature and standard deviation within each month if the airquality dataset
month_sum <- airquality %>%
group_by(Month) %>%
summarise(mean_temp = mean(Temp),
sd_temp = sd(Temp))
month_sum
## # A tibble: 5 x 3 ## Month mean_temp sd_temp ## <int> <dbl> <dbl> ## 1 5 65.5 6.85 ## 2 6 79.1 6.60 ## 3 7 83.9 4.32 ## 4 8 84.0 6.59 ## 5 9 76.9 8.36
Using the ChickWeight dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study.
Employ dplyr verbs and the %>% operator.
weight_diff <- ChickWeight %>%
group_by(Chick) %>%
summarise(weight_diff = max(weight) - min(weight))
head(weight_diff)
## # A tibble: 6 x 2 ## Chick weight_diff ## <ord> <dbl> ## 1 18 4.00 ## 2 16 16.0 ## 3 15 27.0 ## 4 13 55.0 ## 5 9 58.0 ## 6 20 76.0
Using the ChickWeight dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study.
Employ dplyr verbs and the %>% operator.
(Hint: first() and last() may be useful here.)
diet_summ <- ChickWeight %>%
group_by(Diet, Chick) %>%
summarise(weight_gain = last(weight) - first(weight)) %>%
group_by(Diet) %>%
summarise(mean_gain = mean(weight_gain))
diet_summ
## # A tibble: 4 x 2 ## Diet mean_gain ## <fctr> <dbl> ## 1 1 115 ## 2 2 174 ## 3 3 230 ## 4 4 188